Completeness score (clustering)#

Completeness answers:

“Do all samples of each true class end up in the same predicted cluster?”

  • If a class is split across multiple clusters → completeness goes down.

  • If multiple classes are merged into one cluster → completeness can stay high.

This is why completeness is typically reported together with homogeneity (or combined as V-measure).


Learning goals#

By the end you should be able to:

  • interpret completeness in plain language

  • derive it from entropy / conditional entropy

  • implement it from scratch in NumPy

  • visualize how different labelings affect the score

  • use it to tune a simple clustering model (with a caveat)

Quick import#

from sklearn.metrics import completeness_score

Notation (quick)#

  • True class labels: \(C\) (a categorical variable)

  • Predicted cluster labels: \(K\) (a categorical variable)

  • \(n\) = number of samples

  • Contingency table: \(n_{ck}\) = #samples of class \(c\) assigned to cluster \(k\)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.metrics import completeness_score as skl_completeness_score
from sklearn.metrics import homogeneity_score as skl_homogeneity_score
from sklearn.metrics import v_measure_score as skl_v_measure_score

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(42)
np.set_printoptions(precision=4, suppress=True)

1) Intuition: completeness cares about splitting classes#

A good clustering is often described by two complementary properties:

  • Homogeneity: each cluster contains (mostly) one class.

  • Completeness: each class is contained (mostly) in one cluster.

Completeness ignores cluster names: relabeling clusters (e.g. swapping cluster IDs 0 and 1) does not change the score.

A useful diagnostic is the contingency table (a.k.a. class–cluster confusion matrix): rows = true classes, columns = predicted clusters.

2) The math (entropy view)#

Let \(n_{ck}\) be the contingency table and \(n\) the total number of samples.

Define probabilities:

  • \(p(c,k) = n_{ck}/n\)

  • \(p(c) = n_c/n\) where \(n_c = \sum_k n_{ck}\)

  • \(p(k) = n_k/n\) where \(n_k = \sum_c n_{ck}\)

Entropy of the cluster labeling:

\[ H(K) = -\sum_{k} p(k)\log p(k) \]

Conditional entropy of clusters given the true classes:

\[ H(K\mid C) = -\sum_{c} p(c) \sum_k p(k\mid c)\log p(k\mid c) \]

with \(p(k\mid c) = n_{ck}/n_c\).

Completeness is:

\[\begin{split} \text{completeness}(C,K) = \begin{cases} 1 - \frac{H(K\mid C)}{H(K)} & \text{if } H(K) > 0 \\ 1 & \text{if } H(K)=0 \end{cases} \end{split}\]

Notes:

  • The log base does not matter (it cancels in the ratio).

  • \(H(K)=0\) happens when all samples are put in a single cluster (a degenerate but “complete” labeling).

Since mutual information is \(I(C;K) = H(K) - H(K\mid C)\), you can also write:

\[ \text{completeness}(C,K) = \frac{I(C;K)}{H(K)}. \]
def contingency_matrix_np(y_true, y_pred):
    """Contingency table n_{ck} with rows=classes, cols=clusters."""
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    if y_true.shape[0] != y_pred.shape[0]:
        raise ValueError("y_true and y_pred must have the same length")
    if y_true.ndim != 1 or y_pred.ndim != 1:
        raise ValueError("y_true and y_pred must be 1D arrays")

    classes, class_idx = np.unique(y_true, return_inverse=True)
    clusters, cluster_idx = np.unique(y_pred, return_inverse=True)

    contingency = np.zeros((classes.size, clusters.size), dtype=np.int64)
    np.add.at(contingency, (class_idx, cluster_idx), 1)
    return contingency, classes, clusters


def entropy_from_counts(counts):
    """Entropy H(X) from counts (natural log)."""
    counts = np.asarray(counts, dtype=float)
    total = counts.sum()
    if total <= 0:
        return 0.0
    probs = counts[counts > 0] / total
    return float(-np.sum(probs * np.log(probs)))


def conditional_entropy_from_contingency(contingency, *, given="rows"):
    """Conditional entropy from a contingency table.

    - given="rows": H(columns | rows)   -> H(K | C)
    - given="cols": H(rows | columns)  -> H(C | K)
    """
    contingency = np.asarray(contingency, dtype=float)
    n = contingency.sum()
    if n <= 0:
        return 0.0

    if given == "rows":
        weights = contingency.sum(axis=1)
        entropies = [entropy_from_counts(row) for row in contingency]
    elif given == "cols":
        weights = contingency.sum(axis=0)
        entropies = [entropy_from_counts(col) for col in contingency.T]
    else:
        raise ValueError("given must be 'rows' or 'cols'")

    weights = weights / n
    return float(np.sum(weights * np.array(entropies)))


def completeness_score_np(y_true, y_pred):
    contingency, _, _ = contingency_matrix_np(y_true, y_pred)

    h_k = entropy_from_counts(contingency.sum(axis=0))
    if h_k == 0.0:
        return 1.0

    h_k_given_c = conditional_entropy_from_contingency(contingency, given="rows")
    return 1.0 - (h_k_given_c / h_k)


def homogeneity_score_np(y_true, y_pred):
    contingency, _, _ = contingency_matrix_np(y_true, y_pred)

    h_c = entropy_from_counts(contingency.sum(axis=1))
    if h_c == 0.0:
        return 1.0

    h_c_given_k = conditional_entropy_from_contingency(contingency, given="cols")
    return 1.0 - (h_c_given_k / h_c)


def v_measure_score_np(y_true, y_pred, *, beta=1.0):
    """V-measure = harmonic mean of homogeneity and completeness."""
    h = homogeneity_score_np(y_true, y_pred)
    c = completeness_score_np(y_true, y_pred)

    denom = beta * h + c
    if denom == 0.0:
        return 0.0
    return (1.0 + beta) * h * c / denom
# Toy labelings to build intuition
y_true = np.repeat([0, 1, 2], repeats=50)

# A) Perfect clustering, but cluster IDs are permuted
y_pred_permuted = np.repeat([1, 0, 2], repeats=50)

# B) Split class 0 across two clusters (hurts completeness)
y_pred_split = np.concatenate(
    [np.repeat([0, 1], repeats=[25, 25]), np.repeat(2, 50), np.repeat(3, 50)]
)

# C) Put everything into one cluster (degenerate, but completeness==1 by convention)
y_pred_one_cluster = np.zeros_like(y_true)

cases = {
    "A) permuted (perfect)": y_pred_permuted,
    "B) split one class": y_pred_split,
    "C) one cluster": y_pred_one_cluster,
}

for name, y_pred in cases.items():
    c_np = completeness_score_np(y_true, y_pred)
    c_skl = skl_completeness_score(y_true, y_pred)
    h_np = homogeneity_score_np(y_true, y_pred)
    h_skl = skl_homogeneity_score(y_true, y_pred)
    v_np = v_measure_score_np(y_true, y_pred)
    v_skl = skl_v_measure_score(y_true, y_pred)

    print(name)
    print(f"  completeness  numpy={c_np:.4f}  sklearn={c_skl:.4f}")
    print(f"  homogeneity   numpy={h_np:.4f}  sklearn={h_skl:.4f}")
    print(f"  v-measure     numpy={v_np:.4f}  sklearn={v_skl:.4f}")
    print()
A) permuted (perfect)
  completeness  numpy=1.0000  sklearn=1.0000
  homogeneity   numpy=1.0000  sklearn=1.0000
  v-measure     numpy=1.0000  sklearn=1.0000

B) split one class
  completeness  numpy=0.8262  sklearn=0.8262
  homogeneity   numpy=1.0000  sklearn=1.0000
  v-measure     numpy=0.9049  sklearn=0.9049

C) one cluster
  completeness  numpy=1.0000  sklearn=1.0000
  homogeneity   numpy=0.0000  sklearn=0.0000
  v-measure     numpy=0.0000  sklearn=0.0000
def plot_contingency_heatmap(y_true, y_pred, *, title):
    contingency, classes, clusters = contingency_matrix_np(y_true, y_pred)
    fig = go.Figure(
        data=go.Heatmap(
            z=contingency,
            x=[str(k) for k in clusters],
            y=[str(c) for c in classes],
            colorscale="Blues",
            showscale=False,
        )
    )

    for i, c in enumerate(classes):
        for j, k in enumerate(clusters):
            fig.add_annotation(
                x=str(k),
                y=str(c),
                text=str(int(contingency[i, j])),
                showarrow=False,
                font=dict(color="black"),
            )

    fig.update_layout(
        title=title,
        xaxis_title="predicted cluster",
        yaxis_title="true class",
        margin=dict(l=40, r=40, t=70, b=40),
    )
    return fig


for name, y_pred in cases.items():
    c = completeness_score_np(y_true, y_pred)
    fig = plot_contingency_heatmap(y_true, y_pred, title=f"{name} (completeness={c:.3f})")
    fig.show()
# How the score reacts when one class is gradually split

n0, n1 = 200, 200
y_true_2 = np.array([0] * n0 + [1] * n1)
base_pred = np.array([0] * n0 + [1] * n1)

fractions = np.linspace(0.0, 1.0, 41)
completenesses = []
homogeneities = []
v_measures = []

for f in fractions:
    y_pred = base_pred.copy()
    move = int(round(f * n0))
    idx = np.arange(n0)
    y_pred[idx[:move]] = 1  # move some of class 0 into cluster 1

    completenesses.append(completeness_score_np(y_true_2, y_pred))
    homogeneities.append(homogeneity_score_np(y_true_2, y_pred))
    v_measures.append(v_measure_score_np(y_true_2, y_pred))

fig = go.Figure()
fig.add_trace(go.Scatter(x=fractions, y=completenesses, mode="lines+markers", name="completeness"))
fig.add_trace(go.Scatter(x=fractions, y=homogeneities, mode="lines+markers", name="homogeneity"))
fig.add_trace(go.Scatter(x=fractions, y=v_measures, mode="lines+markers", name="v-measure"))
fig.update_layout(
    title="Splitting one class across clusters: completeness vs homogeneity",
    xaxis_title="fraction of class 0 moved into the other cluster",
    yaxis_title="score",
    yaxis=dict(range=[-0.02, 1.02]),
)
fig.show()

What the plot should make you notice#

  • When you start moving part of class 0 into another cluster, that class becomes split → completeness drops.

  • As you move all of class 0 into the other cluster, you end up with one effective cluster → completeness returns to 1.

So completeness alone can be maximized by collapsing everything into a single cluster.

That’s why you typically report completeness together with homogeneity (or V-measure).

3) Using completeness to optimize a simple algorithm (k-means model selection)#

Completeness is an external clustering metric: it requires ground-truth class labels.

So you usually use it for evaluation or for model selection when you do have labels (e.g. benchmarking different clustering methods, tuning the number of clusters \(k\), etc.).

Important caveat:

  • If you blindly maximize completeness over \(k\), the best answer can be \(k=1\) (the degenerate “everything in one cluster” solution).

  • Maximizing completeness tends to favor fewer clusters, because merging classes is not penalized.

  • In practice you either:

    • constrain \(k\) to a sensible range (e.g. \(k \ge 2\), or \(k\) near the expected number of classes), and/or

    • optimize/report V-measure instead of completeness alone.

# Synthetic 2D dataset with known classes

centers = np.array([[-3.0, 0.0], [3.0, 0.0], [0.0, 3.5]])
n_per_class = 200
std = 0.8

X = np.vstack([rng.normal(loc=center, scale=std, size=(n_per_class, 2)) for center in centers])
y_true_blobs = np.repeat(np.arange(centers.shape[0]), repeats=n_per_class)

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y_true_blobs.astype(str),
    title="Synthetic data (true classes)",
    labels={"x": "x1", "y": "x2", "color": "true class"},
)
fig.show()
def kmeans_fit_predict_np(X, k, *, n_init=10, max_iter=100, tol=1e-6, rng=rng):
    """A small NumPy k-means (Lloyd's algorithm) implementation.

    Returns: labels, centroids, inertia
    """
    X = np.asarray(X, dtype=float)
    if X.ndim != 2:
        raise ValueError("X must be 2D")
    n_samples = X.shape[0]
    if not (1 <= k <= n_samples):
        raise ValueError("k must be in [1, n_samples]")

    best_inertia = np.inf
    best_labels = None
    best_centroids = None

    for _ in range(n_init):
        centroids = X[rng.choice(n_samples, size=k, replace=False)].copy()

        for _ in range(max_iter):
            # squared distances: shape (n_samples, k)
            d2 = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
            labels = np.argmin(d2, axis=1)

            new_centroids = centroids.copy()
            for j in range(k):
                mask = labels == j
                if not np.any(mask):
                    new_centroids[j] = X[rng.integers(0, n_samples)]
                else:
                    new_centroids[j] = X[mask].mean(axis=0)

            shift = np.max(np.linalg.norm(new_centroids - centroids, axis=1))
            centroids = new_centroids
            if shift < tol:
                break

        d2 = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
        inertia = float(np.sum(np.min(d2, axis=1)))

        if inertia < best_inertia:
            best_inertia = inertia
            best_labels = labels
            best_centroids = centroids

    return best_labels, best_centroids, best_inertia
# "Optimize" k-means by selecting k that maximizes completeness (and compare to v-measure)

ks = np.arange(1, 9)
results = []

for k in ks:
    labels, _, inertia = kmeans_fit_predict_np(X, k, n_init=8, max_iter=60)
    c = completeness_score_np(y_true_blobs, labels)
    v = v_measure_score_np(y_true_blobs, labels)
    results.append((k, c, v, inertia))

results = np.array(results, dtype=float)
k_vals = results[:, 0]
c_vals = results[:, 1]
v_vals = results[:, 2]
inertia_vals = results[:, 3]

fig = go.Figure()
fig.add_trace(go.Scatter(x=k_vals, y=c_vals, mode="lines+markers", name="completeness"))
fig.add_trace(go.Scatter(x=k_vals, y=v_vals, mode="lines+markers", name="v-measure"))
fig.update_layout(
    title="Model selection for k-means: completeness vs k (and v-measure)",
    xaxis_title="k (number of clusters)",
    yaxis_title="score",
    yaxis=dict(range=[-0.02, 1.02]),
)
fig.show()

fig = px.line(
    x=k_vals,
    y=inertia_vals,
    markers=True,
    title="k-means inertia vs k (lower is better for the k-means objective)",
    labels={"x": "k", "y": "inertia"},
)
fig.show()

best_k_unconstrained = int(k_vals[np.argmax(c_vals)])
best_k_k_ge_2 = int(k_vals[1:][np.argmax(c_vals[1:])])
best_k_by_v_measure = int(k_vals[np.argmax(v_vals)])

print(f"Best k by completeness (unconstrained): k={best_k_unconstrained}")
print(f"Best k by completeness (restrict k>=2): k={best_k_k_ge_2}")
print(f"Best k by v-measure: k={best_k_by_v_measure}")
Best k by completeness (unconstrained): k=1
Best k by completeness (restrict k>=2): k=2
Best k by v-measure: k=3
# Visualize the completeness-selected solution (restrict k>=2)

k_star = best_k_k_ge_2
labels_star, centroids_star, _ = kmeans_fit_predict_np(X, k_star, n_init=12, max_iter=80)

c_star = completeness_score_np(y_true_blobs, labels_star)
h_star = homogeneity_score_np(y_true_blobs, labels_star)
v_star = v_measure_score_np(y_true_blobs, labels_star)

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=labels_star.astype(str),
    title=f"k-means clustering (k={k_star})",
    labels={"x": "x1", "y": "x2", "color": "cluster"},
)
fig.add_trace(
    go.Scatter(
        x=centroids_star[:, 0],
        y=centroids_star[:, 1],
        mode="markers",
        marker=dict(symbol="x", size=12, color="black"),
        name="centroids",
    )
)
fig.show()

print(f"completeness={c_star:.4f}  homogeneity={h_star:.4f}  v-measure={v_star:.4f}")

fig = plot_contingency_heatmap(
    y_true_blobs,
    labels_star,
    title=f"Contingency table for k-means (k={k_star}, completeness={c_star:.3f})",
)
fig.show()
completeness=0.9594  homogeneity=0.5589  v-measure=0.7063

4) Pros, cons, and when to use it#

Pros#

  • Interpretable: directly measures “class splitting”.

  • Label-invariant: relabeling cluster IDs does not change the score.

  • Normalized: outputs a value in \([0,1]\).

Cons / pitfalls#

  • Doesn’t penalize merging classes: a single-cluster solution gets completeness \(=1\).

  • Needs ground truth labels: not usable as a purely unsupervised internal metric.

  • Not smooth/differentiable: hard to use as a direct training objective; it’s mainly for evaluation/model selection.

Good use cases#

  • Benchmarking clustering methods when you have ground truth (e.g. synthetic data, annotated datasets).

  • Model selection for clustering (choose hyperparameters) when labels are available.

  • As part of V-measure (together with homogeneity).

Exercises#

  1. Modify the toy examples to create a case with high completeness but low homogeneity.

  2. Implement homogeneity_score_np by swapping the roles of classes and clusters and verify it matches the formula.

  3. Compare completeness to other external clustering metrics: ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information).

References#

  • scikit-learn docs: https://scikit-learn.org/stable/api/sklearn.metrics.html

  • Rosenberg & Hirschberg (2007): “V-measure: A conditional entropy-based external cluster evaluation measure”